Expected Error Analysis for Model Selection

نویسندگان

  • Tobias Scheffer
  • Thorsten Joachims
چکیده

In order to select a good hypothesis language (or model) from a collection of possible models, one has to assess the generalization performance of the hypothesis which is returned by a learner that is bound to use some particular model. This paper deals with a new and very eecient way of assessing this generalization performance. We present a new analysis which characterizes the expected generalization error of the hypothesis with least training error in terms of the distribution of error rates of the hypotheses in the model. This distribution can be estimated very eeciently from the data which immediately leads to an eecient model selection algorithm. The analysis predicts learning curves with a very high precision and thus contributes to a better understanding of why and when over-tting occurs. We present empirical studies (controlled experiments on Boolean decision trees and a large-scale text categorization problem) which show that the model selection algorithm leads to error rates which are often as low as those obtained by 10-fold cross validation (sometimes even superior). However, the algorithm is much more eecient (because the learner does not have to be invoked at all) and thus solves model selection problems with as many as thousand relevant attributes and 12,000 examples.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative Approach to the Backward Elimination and for-ward Selection Methods in Modeling the Systematic Risk Based on the ARFIMA-FIGARCH Model

The present study aims to model systematic risk using financial and accounting variables. Accordingly, the data for 174 companies in Tehran Stock Exchange are extracted for the period of 2006 to 2016. First, the systematic risk index is estimated using the ARFIMA-FIGARCH model. Then, based on the research background, 35 affective financial and accounting variables are simultaneously used with t...

متن کامل

Robust model selection using fast and robust bootstrap

Robust model selection procedures control the undue influence that outliers can have on the selection criteria by using both robust point estimators and a bounded loss function when measuring either the goodness-of-fit or the expected prediction error of each model. Furthermore, to avoid favoring over-fitting models, these two measures can be combined with a penalty term for the size of the mod...

متن کامل

A New Approach to Project Risk Responses Selection with Inter-dependent Risks

Risks are natural and inherent characteristics of major projects. Risks are usually considered independently in analysis of risk responses. However, most risks are dependent on each other and dependent risks are rare in the real world. This paper proposes a model for proper risk response selection from the responses portfolio with the purpose of optimization of defined criteria for projects. Th...

متن کامل

Wavelength region selection and spectrophotometric simultaneous determination of naphthol isomers based on net analyte signal

Naphthol isomers were simultaneously and spectrophotometrically determined in wastewater, using a model based on net analyte signal (NAS). The calibration method used is a variation of the original hybrid linear analysis method as proposed by Goicoechea and Olivieri (HLA/GO). Owing to spectral interferences, the simultaneous determination of mixtures of naphthol isomers, using a spectrophotomet...

متن کامل

Application of Genetic Algorithms for Pixel Selection in MIA-QSAR Studies on Anti-HIV HEPT Analogues for New Design Derivatives

Quantitative structure-activity relationship (QSAR) analysis has been carried out with a series of 107 anti-HIV HEPT compounds with antiviral activity, which was performed by chemometrics methods. Bi-dimensional images were used to calculate some pixels and multivariate image analysis was applied to QSAR modelling of the anti-HIV potential of HEPT analogues by means of multivariate calibration,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999